Explore CONUS404 Dataset#
This dataset was created by extracting specified variables from a collection of wrf2d output files, rechunking to better facilitate data extraction for a variety of use cases, and adding CF conventions to allow easier analysis, visualization and data extraction using Xarray and Holoviz.
import os
os.environ['USE_PYGEOS'] = '0'
import fsspec
import xarray as xr
import hvplot.xarray
import intake
import metpy
import cartopy.crs as ccrs
1) Open dataset from Intake Catalog#
Select
on-premdataset from /caldera if running on prem (Denali/Tallgrass)Select
cloud/osnobject store data if running elsewhere
# open the hytest data intake catalog
hytest_cat = intake.open_catalog("https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml")
list(hytest_cat)
['conus404-catalog',
'conus404-drb-eval-tutorial-catalog',
'nhm-v1.0-daymet-catalog',
'nhm-v1.1-c404-bc-catalog',
'nhm-v1.1-gridmet-catalog',
'nwis-streamflow-usgs-gages-onprem',
'nwis-streamflow-usgs-gages-cloud',
'nwm21-streamflow-usgs-gages-onprem',
'nwm21-streamflow-usgs-gages-cloud',
'nwm21-streamflow-cloud',
'nwm21-scores',
'lcmap-cloud',
'rechunking-tutorial-cloud']
# open the conus404 sub-catalog
cat = hytest_cat['conus404-catalog']
list(cat)
['conus404-hourly-onprem',
'conus404-hourly-cloud',
'conus404-hourly-osn',
'conus404-daily-diagnostic-onprem',
'conus404-daily-diagnostic-cloud',
'conus404-daily-diagnostic-osn',
'conus404-daily-onprem',
'conus404-daily-cloud',
'conus404-daily-osn',
'conus404-monthly-onprem',
'conus404-monthly-cloud',
'conus404-monthly-osn']
## Select the dataset you want to read into your notebook and preview its metadata
dataset = 'conus404-hourly-osn'
cat[dataset]
conus404-hourly-osn:
args:
consolidated: true
storage_options:
anon: true
client_kwargs:
endpoint_url: https://renc.osn.xsede.org
requester_pays: false
urlpath: s3://rsignellbucket2/hytest/conus404/conus404_hourly_202302.zarr
description: 'CONUS404 Hydro Variable subset, 40 years of hourly values. These files
were created wrfout model output files (see ScienceBase data release for more
details: https://www.sciencebase.gov/catalog/item/6372cd09d34ed907bf6c6ab1). This
dataset is stored on AWS S3 cloud storage in a requester-pays bucket. You can
work with this data for free in any environment (there are no egress fees).'
driver: intake_xarray.xzarr.ZarrSource
metadata:
catalog_dir: https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/subcatalogs
2) Set Up AWS Credentials (Optional)#
This notebook reads data from the OSN pod by default, which is object store data on a high speed internet connection that is free to access from any environment. If you change this notebook to use one of the CONUS404 datasets stored on S3 (options ending in -cloud), you will be pulling data from a requester-pays S3 bucket. This means you have to set up your AWS credentials, else we won’t be able to load the data. Please note that reading the -cloud data from S3 may incur charges if you are reading data outside of the us-west-2 region or running the notebook outside of the cloud altogether. If you would like to access one of the -cloud options, uncomment and run the following code snippet to set up your AWS credentials. You can find more info about this AWS helper function here.
# uncomment the lines below to read in your AWS credentials if you want to access data from a requester-pays bucket (-cloud)
# os.environ['AWS_PROFILE'] = 'default'
# %run ../environment_set_up/Help_AWS_Credentials.ipynb
3) Parallelize with Dask#
Some of the steps we will take are aware of parallel clustered compute environments
using dask. We’re going to start a cluster now so that future steps can take advantage
of this ability.
This is an optional step, but speed ups data loading significantly, especially when accessing data from the cloud.
We have documentation on how to start a Dask Cluster in different computing environments here.
%run ../environment_set_up/Start_Dask_Cluster_Nebari.ipynb
## If this notebook is not being run on Nebari/ESIP, replace the above
## path name with a helper appropriate to your compute environment. Examples:
# %run ../environment_set_up/Start_Dask_Cluster_Denali.ipynb
# %run ../environment_set_up/Start_Dask_Cluster_Tallgrass.ipynb
# %run ../environment_set_up/Start_Dask_Cluster_Desktop.ipynb
# %run ../environment_set_up/Start_Dask_Cluster_PangeoCHS.ipynb
The 'cluster' object can be used to adjust cluster behavior. i.e. 'cluster.adapt(minimum=10)'
The 'client' object can be used to directly interact with the cluster. i.e. 'client.submit(func)'
The link to view the client dashboard is:
> https://nebari.esipfed.org/gateway/clusters/dev.f2fe87a6b9284e27a26f9a3089da7d60/status
4) Explore the dataset#
print(f"Reading {dataset} metadata...", end='')
ds = cat[dataset].to_dask().metpy.parse_cf()
print("done")
# Examine the grid data structure for SNOW:
ds.SNOW
Reading conus404-hourly-osn metadata...
done
<xarray.DataArray 'SNOW' (time: 368064, y: 1015, x: 1367)>
dask.array<open_dataset-986a06d6b28d681a9e9ca356f92aad3eSNOW, shape=(368064, 1015, 1367), dtype=float32, chunksize=(144, 175, 175), chunktype=numpy.ndarray>
Coordinates:
lat (y, x) float32 dask.array<chunksize=(175, 175), meta=np.ndarray>
lon (y, x) float32 dask.array<chunksize=(175, 175), meta=np.ndarray>
* time (time) datetime64[ns] 1979-10-01 ... 2021-09-25T23:00:00
* x (x) float64 -2.732e+06 -2.728e+06 ... 2.728e+06 2.732e+06
* y (y) float64 -2.028e+06 -2.024e+06 ... 2.024e+06 2.028e+06
metpy_crs object Projection: lambert_conformal_conic
Attributes:
description: SNOW WATER EQUIVALENT
grid_mapping: crs
long_name: Snow water equivalent
units: kg m-2Looks like this dataset is organized in three coordinates (x, y, and time). There is a
metpy_crs attached:
crs = ds['SNOW'].metpy.cartopy_crs
crs
<cartopy.crs.LambertConformal object at 0x7fdea88ae7d0>
Example A: Load the entire spatial domain for a variable at a specific time step#
%%time
da = ds.SNOW_ACC_NC.sel(time='2009-12-24 00:00').load()
### NOTE: the `load()` is dask-aware, so will operate in parallel if
### a cluster has been started.
CPU times: user 441 ms, sys: 52.2 ms, total: 494 ms
Wall time: 13 s
da.hvplot.quadmesh(x='lon', y='lat', rasterize=True, geo=True, tiles='OSM', cmap='viridis').opts('Image', alpha=0.5)
/home/conda/users/222098f819851fd4ec9d4c9445d950452119627537a69aa8cd965d7dbb1ef7e2-20230602-144731-342097-199-pangeo/lib/python3.10/site-packages/geoviews/operation/__init__.py:14: HoloviewsDeprecationWarning: 'ResamplingOperation' is deprecated and will be removed in version 1.17, use 'ResampleOperation2D' instead.
from holoviews.operation.datashader import (
Example B: Load a time series for a variable at a specific grid cell for a specified time range#
SIDE NOTE
To identify a point, we will start with its lat/lon coordinates. But the
data is in Lambert Conformal Conic… need to re-project/transform using the
built-in crs we examined earlier:
lat,lon = 39.978322,-105.2772194
x, y = crs.transform_point(lon, lat, src_crs=ccrs.PlateCarree())
print(x,y) # these vals are in LCC
-618215.7570892666 121899.89692719541
%%time
da = ds.PREC_ACC_NC.sel(x=x, y=y, method='nearest').sel(time=slice('2013-01-01 00:00','2013-12-31 00:00')).load()
CPU times: user 156 ms, sys: 4.45 ms, total: 160 ms
Wall time: 5.78 s
da.hvplot(x='time', grid=True)
Stop cluster#
client.close(); cluster.shutdown()
/home/conda/users/222098f819851fd4ec9d4c9445d950452119627537a69aa8cd965d7dbb1ef7e2-20230602-144731-342097-199-pangeo/lib/python3.10/site-packages/dask_gateway/client.py:1014: RuntimeWarning: coroutine 'rpc.close_rpc' was never awaited
self.scheduler_comm.close_rpc()